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Treatment  Heterogeneity  and 
Individual  Qualitative  Interaction 

ABSTRACT 

Plausibility  of  high  variability  in  treatment  effects  across  individuals  has  been  recognized 
as  an  important  consideration  in  clinical  studies.  Surprisingly,  little  attention  has  been  given  to 
evaluating  this  variability  in  design  of  clinical  trials  or  analyses  of  resulting  data.  High  variation 
in  a  treatment’s  efficacy  or  safety  across  individuals  (referred  to  herein  as  treatment 
heterogeneity)  may  have  important  consequences  because  the  optimal  treatment  choice  for  an 
individual  may  be  different  from  that  suggested  by  a  study  of  average  effects.  We  call  this  an 
individual  qualitative  interaction  (IQI),  borrowing  terminology  from  earlier  work  -  referring  to  a 
qualitative  interaction  (QI)  being  present  when  the  optimal  treatment  varies  across  “groups”  of 
individuals.  At  least  three  techniques  have  been  proposed  to  investigate  treatment  heterogeneity: 
techniques  to  detect  a  QI,  use  of  measures  such  as  the  density  overlap  of  two  outcome  variables 
under  different  treatments,  and  use  of  cross-over  designs  to  observe  “individual  effects.”  We 
elucidate  underlying  connections  among  them,  their  limitations  and  some  assumptions  that  may 
be  required.  We  do  so  under  a  potential  outcomes  framework  that  can  add  insights  to  results 
from  usual  data  analyses  and  to  study  design  features  that  improve  the  capability  to  more  directly 
assess  treatment  heterogeneity. 

KEY  WORDS:  Causation;  Crossover  interaction;  Individual  effects;  Potential  outcomes; 
Probability  of  similar  response;  Subject-treatment  interaction. 


1.  INTRODUCTION 


u..,it  appears  that  white  sheep  and  pigs  are  injured  by  certain  plants,  whilst  dark-  coloured 
individuals  escape.'’’’  ~  Charles  Darwin 


“ What  is  food  to  one  to  some  becomes  Fierce  poison’’’  ~  Lucretius 


The  quotations  above  illustrate  that  individual  differences  in  response  to  stimuli  or 
‘treatments’  have  been  the  subject  of  interest  throughout  recorded  history.  They  further  illustrate 
two  kinds  of  interactions.  Darwin  points  out  an  interaction  in  which  one  type  of  animal  is  harmed 
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by  a  certain  treatment  whereas  other  animals  are  not  harmed,  but  are  not  necessarily  helped.  In 
contrast,  Lucretius  points  out  a  more  dramatic  type  of  interaction  in  which  what  is  helpful  to 
some  is  actually  harmful  to  others.  More  formally,  treatment  heterogeneity  is  present  when  the 
effect  of  a  treatment,  say  T,  with  respect  to  a  reference  treatment,  R,  varies  across  subsets  or 
individuals  in  a  population.  At  the  individual  level,  this  variability  is  called  subject-treatment 
interaction  (Gadbury  2010).  A  consequence  of  this  heterogeneity  is  that  different  individuals  or 
subsets  may  respond  to  treatment  in  opposite  directions,  with  treatment  T  having  higher  efficacy 
for  some  and  treatment  R  having  higher  efficacy  for  others.  The  term  qualitative  interaction  (QI) 
has  been  used  to  describe  this  situation  at  the  subset  level  (Peto  1982;  Gail  and  Simon  1985),  and 
tests  have  been  developed  to  detect  a  QI  (Gail  and  Simon  1985;  Silvapulle  2001;  Li  and  Chan 
2006).  When  such  tests  are  significant,  optimal  treatments  may  differ  among  subsets  (Byar  and 
Corle  1977).  A  “quantitative”  interaction  (Peto  1982)  exists  when  the  magnitudes  of  the  effects 
of  a  treatment  differ  across  subsets,  but  are  in  the  same  direction.  Herein  we  refer  to  a  “subset 
interaction”  as  a  more  general  term  that  includes  quantitative  and/or  qualitative  interactions. 

Taking  the  idea  of  subsets  to  its  limit,  we  can  recognize  that  each  person  is  unique  and 
can  be  considered  their  own  subset.  Then  analogous  to  the  QI  described  above,  an  individual 
qualitative  interaction  (IQI)  is  present  when  at  least  two  individuals  respond  in  the  opposite 
direction  to  treatment.  A  fact  that  initially  seems  counter-intuitive  to  many  clinical  investigators 
who  are  used  to  discussing  ‘non-responders’  in  standard  clinical  trials  (e.g.,  Inoue  et  al.  2010),  is 
that  individual  effects  of  a  treatment  T  with  respect  to  R  are  inherently  unobservable  in  a  two 
treatment  comparison  study  because  only  one  of  the  two  outcome  variables  is  observable  for 
each  subject,  depending  on  the  treatment  assigned  to  that  subject.  Let  potential  outcome 
variables  (Rubin,  1974)  to  treatments  T  and  R  be  given  by  X  and  Y,  respectively,  with  an 
individual  effect  defined  by  the  variable  D  —  X  —  Y .  Suppose,  as  in  Gadbury  and  Iyer  (2000), 
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that  the  potential  outcome  variables,  ( X,Y^  are  modeled  by  a  bivariate  population  distribution 


with  mean  ( /ux ,  pY  f  and  covariance  matrix 


where  the  covariance  <jxy  =  axaYpXY . 


The  distribution  of  D  then  has  mean  pD  =  px  —  pY  and  variance 

al  =  Var(X  —  Y)  =  ax+aY  —  2oxoYpXY.  (1) 

In  a  two  treatment  comparison  study  where  subjects  are  randomly  assigned  to  treatments  T  and 
R,  the  mean  treatment  effect,  pD,  can  be  estimated  but  the  variance,  erj,  cannot  because  there  is 
no  information  in  observable  data  to  estimate  the  correlation,  pXY. 

Assume  throughout  that  pD  —  E(X  —  Y)  —  px  —  pY  >  x  represents  a  beneficial  average 
effect  of  T  over  R  where  r  is  some  threshold  particular  to  the  treatments  being  compared 
(that  is,  treatment  T  may  have  costs  associated  with  it  over  treatment  R  that  require  a  sufficiently 
positive  value  for  pD  before  it  is  claimed  that  T  is  a  preferred  treatment  over  R).  Hereafter,  for 
convenience,  we  take  r  =  0.  Subject-treatment  interaction  is  present  in  the  population  when 
erj  >  0.  If  there  is  no  interaction  at  the  individual  level,  erj  =  0,  and  there  is  a  constant 
treatment  effect  (Holland  1986).  However,  as  individual  treatment  effects  become  more 
heterogeneous,  crj  gets  larger,  and  a  positive  proportion  of  individuals  in  the  population  with  a 
value  of  D  less  than  zero  (i.e.,  an  IQI  is  present)  becomes  more  plausible,  despite  pD  >  0.  We 
denote  a  proportion  of  individuals  having  an  unfavorable  outcome  to  treatment  T  versus  R  as 
PIQI.  If  the  bivariate  distribution  given  above  is  normal,  then 

PiQi  =  < d(— Y  (2) 

where  <t>(-)  is  the  standard  normal  cumulative  distribution  function  (CDF).  Normality  is 
assumed  here  for  convenience,  but  it  is  not  required  for  definition  of  the  PIQI. 
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It  has  been  remarked  that  medicine  today  generally  makes  use  of  statistical  information 
gathered  about  the  general  population  (often  about  the  “average”  subject)  and  then  applies  it  to 
the  individual  (cf.  Marshall,  1997).  Some  have  suggested  that  information  about  a  mean 
treatment  effect  be  supplemented  by  information  about  treatment  effect  heterogeneity  (cf., 
Longford,  1999).  This  paper  explores  methods  to  determine  whether  erj  and/or  PIQI  are  positive 
using  observed  data  from  a  two  treatment  comparison  study  where  treatments  are  randomly 
assigned.  In  particular,  we  discuss  three  approaches:  tests  for  subset  interaction,  the  proportion  of 
similar  responses  (PSR),  and  cross-over  designs.  We  review  each  method  using  the  potential 
outcomes  structure  to  highlight  important  connections  and  assumptions.  A  data  example  is  used 
to  illustrate  the  ideas.  Though  the  focus  here  is  a  clinical  setting,  interest  in  other  aspects  of  a 
treatment  effect  distribution  besides  the  mean  has  emerged  from  other  fields.  For  example,  see 
Fan  and  Park,  2009,  2010,  for  application  in  econometrics. 

It  is  recommended  that  new  ideas  for  clinical  trial  designs  and  methodologies  be  pursued 
that  may  lead  to  further  improvements  in  our  ability  to  estimate  and  test  aspects  of  individual 
treatment  response  heterogeneity.  We  offer  potential  outcomes  as  a  useful  framework  for 
understanding  individual  treatment  heterogeneity  and  its  consequences.  It  helps  to  distinguish 
heterogeneity  that  is  “explainable”  in  observed  data  from  unexplained  heterogeneity.  Though 
subject-treatment  interaction  and  the  PIQI  cannot  be  directly  estimated  in  observed  data  without 
introducing  additional  assumptions,  bounds  can  be  estimated.  Consequences  of  unexplained 
heterogeneity,  reflected  by  estimable  bounds  for  the  PIQI,  can  alert  investigators  to  the  possible 
existence  of  an  unobserved  covariate  that  could  be  potentially  predictive  of  individual  success  to 
a  treatment  application. 
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2.  AN  ILLUSTRATIVE  EXAMPLE 


Bookstores  typically  devote  shelf  space  to  a  wide  variety  of  dieting  books,  each  book 
frequently  containing  anecdotes  describing  substantial  weight  loss  and  other  remarkable 
improvements  to  health  for  particular  individuals.  Obesity  researchers  are  cautious  about 
embracing  various  new  diets  due  to  limited  evidence  that  they  outperform  more  traditional 
programs  of  weight  loss.  Two  diets  whose  relative  merits  have  been  discussed  are  the  low 
carbohydrate  diet  (sometimes  called  a  reduced-glycemic-load  diet)  versus  a  more  traditional 
portion  controlled  low-fat  diet.  Results  from  clinical  studies  comparing  these  two  have  been 
inconsistent,  with  some  suggesting  that  one  appears  more  effective  for  weight  loss  at  some  time 
points,  on  average,  and  others  that  show  no  significant  difference.  A  March  4,  2010  article  in  the 
Wall  Street  Journal  reported  on  an  unpublished  study  by  Stanford  University  researchers  that 
suggested  that  individual  differences  in  genetic  predispositions  contribute  to  substantial 
individual  differences  in  the  relative  efficacy  of  one  diet  versus  another  (i.e.,  low  fat  versus  low- 
carbohydrate)  among  overweight  women. 

Given  the  possibility  of  treatment  heterogeneity  in  a  such  a  study  as  well  as  an  IQI,  we 
illustrate  the  ideas  discussed  here  using  a  data  set  that  compared  a  reduced-glycemic-load  diet 
(RGL)  and  a  portion  controlled  low  fat  diet.  The  data  are  a  subset  of  data  analyzed  and  reported 
in  Maki  et  al.,  2007.  Subjects  were  randomized  to  two  treatments:  T  =  the  RGL  diet  (n  =  43)  and 
R  =  the  low  fat  diet  (n  =  43).  A  primary  outcome  variable  was  weight  change  from  baseline  at  12 
weeks,  measured  in  kilograms.  Maki  et  al.,  2007,  also  report  analyses  of  other  outcome  variables 
such  as  waste  circumference,  fat  free  mass,  and  results  from  laboratory  tests.  Several  covariates 
were  measured  such  as  baseline  values  of  outcome  variables,  age,  race,  and  gender.  Using 
notation  described  earlier,  we  consider  the  outcomes  X  =  weight  change  from  baseline  at  12 
weeks  for  subjects  assigned  to  treatment  T,  and  Y  =  weight  change  from  baseline  at  12  weeks  for 
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subjects  assigned  to  R.  Positive  values  of  X  and  Y  are  a  weight  “loss”  (in  kilograms)  from 
baseline.  We  analyze  data  for  the  69  subjects  (out  of  86)  that  remained  in  the  study  at  12  weeks 
(34  in  treatment  T  and  35  in  treatment  R).  For  brevity  and  to  focus  on  the  topic  presented  herein, 
we  do  not  consider  issues  related  to  compliance  or  drop  out  and  initially,  analyze  data  for  only 
the  two  outcomes,  X  and  Y. 

The  treatment  T  group  had  a  mean  weight  loss  of  jux  =  5.39kg  with  a  standard  deviation 
of  cj x  =  3.16kg.  The  treatment  R  group  had  a  mean  weight  loss  of  jiY  =  3.23kg  with  a  standard 
deviation  of  dY  =  3.20  kg.  An  estimate  of  mean  treatment  effect  is  juD  =  2.16  kg,  and  a  t-test  of 
Ho:  /jD=  0 against  an  upper  tailed  alternative  hypothesis  gives  a  p-value  of  0.003. 

Sometimes  investigators  will  interpret  unequal  variance  of  the  outcome  variables  in  each 
treatment  group  as  evidence  of  treatment  heterogeneity.  This  is,  in  fact,  partially  true  because  the 
minimum  bound  for  the  subject-treatment  interaction  standard  deviation  is  o™in  =  \ax  —  oY  |, 
and  this  quantity  can  be  large  when  ox  and  oY  are  very  different.  Estimable  bounds, 

(cr™n,  cr™”)  are  obtained  by  setting  the  correlation  equal  to  1  and  -1,  respectively  (Gadbury  and 

Iyer,  2000).  The  maximum  bound,  o™ax  —  ax  +  oy,  is  not  small  unless  both  standard  deviations 
of  the  potential  outcome  variables  are  small.  From  the  example  data,  the  estimated  bounds  are 
<r™n  =  0.04,  a™  =  6.36 .  The  estimated  standard  errors  obtained  by  2000  bootstrap  samples 
within  treatment  groups  (cf.,  Gadbury  et  al.,  2001)  indicate  that  the  lower  bound  is  not  different 
from  zero.  The  estimated  standard  error  for  <r”ax  is  0.59.  Based  on  only  these  outcome  variables 

and  these  estimated  bounds,  there  is  no  clear  and  compelling  evidence  in  the  data  that  subject- 
treatment  interaction  ‘must’  be  present,  but  there  is  evidence  that  it  ‘could’  be. 

Assuming  as  before  a  bivariate  normal  distribution  for  weight  loss  outcomes  X,Y, 
estimated  bounds  for  the  PIQI  (Gadbury  and  Iyer,  2000)  are  given  as, 
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/  -  X 

max  —  // 

« 0,  PIQI  =0  -4=±  =0.37.  (3) 

v^d  v 

The  bootstrap  standard  error  for  the  upper  bound  is  0.049.  These  results  suggest  that  the 
estimated  proportion  of  the  population  having  an  effect  of  treatment  T  versus  R  in  the  opposite 
direction  from  the  mean  effect  could  be  negligible  or  as  high  as  0.37.  One  may  argue  that  it  is 
more  plausible  that  this  proportion  is  closer  to  zero  rather  than  0.37  because  the  nonestimable 
correlation  between  potential  outcomes  should  be  closer  to  1  rather  than  -1.  Yet  without 
additional  information  or  assumptions,  little  more  can  be  said  about  treatment  heterogeneity  or 
its  consequences  based  on  these  data  alone. 

3.  METHODS  FOR  ASSESSING  TREATMENT  HETEROGENEITY  AND  ITS 

CONSEQUENCES 

Other  techniques  have  been  developed  for  studying  population  treatment  heterogeneity 
and  its  consequences  under  different  assumptions  and  constraints.  For  example,  Hauck  et  al. 
(2000),  in  a  slight  variation  of  our  current  notation  ( X  and  Y  are  considered  individual  averages  in 
a  repeated  measures  cross-over  design),  proposed  erj  be  used  to  determine  whether  T  and  R  were 
individually  bioequivalent.  In  this  section  we  use  potential  outcomes  to  clarify  the  assumptions 
required  to  estimate  erj  and  a  PIQI  under  three  different  strategies.  First  we  establish  a 
connection  between  subset  interaction  and  subject-treatment  interaction  and  show  how  the 
former,  with  an  appropriate  design,  is  a  detectable  consequence  of  the  latter.  Second,  we  show 
that  the  proportion  of  similar  response  (PSR)  or  density  overlap,  though  intuitively  appealing, 
can  be  misleading  when  used  as  a  proxy  for  treatment  heterogeneity  and,  hence,  the  potential 
presence  of  IQI.  Finally,  we  show  that  additional  information  becomes  available  in  cross-over 
designs,  but  that  direct  estimation  of  the  PIQI  requires  further  assumptions. 
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3.1  Identification  of  Subsets 


The  study  of  subset  interaction  presupposes  a  covariate  that  is  a  “grouping  variable”  and 
some  degree  of  homogeneity  of  treatment  response  within  groups,  with  QI  then  explained  by 
differences  in  treatment  response  across  groups.  If  the  grouping  variable  is  continuous,  then 
groups  are  subpopulations  defined  by  values  of  the  covariate.  One  reason  for  subset  analysis  then 
is  to  identify  “which  treatment  is  best  for  which  kinds  of  patients,”  (Byar  and  Corle  1997,  p. 

455).  Standard  methods  seek  to  find  such  subsets  through  an  investigation  of  interaction  effects 
(Byar  and  Corle  1977;  Simon  1982)  or  a  direct  test  for  a  qualitative  interaction  (Gail  and  Simon 
1985;  Silvapulle  2001;  Li  and  Chan  2006).  In  each  case  the  interaction  is  detectable  by  changes 
in  the  mean  response  across  subsets.  Using  potential  outcomes,  the  subject-treatment  interaction 
variance  crj,  can  be  decomposed  into  an  explainable  component  (i.e.,  a  component  that  is 
estimable)  and  an  unexplainable  component  (remaining  subject-treatment  interaction  within  a 
subset). 

3.1.1  A  continuous  covariate 

First  consider,  as  in  Gadbury  et  al.,  (2001),  a  continuous  covariate  Z  (i.e.,  not  affected  by 
the  treatment)  that  augments  potential  outcomes  ( X ,  Y ).  Assume  the  distribution  of  D  given 
Z  =  z0  is  normal  with  conditional  mean 

Mn|z=z0  —  Mr  —  £0f  +  (Pxz  ~  Pyz)(z o  —  Mz)  (4) 

and  conditional  variance, 

°d\z  ~  °x\z  +  °y\z  ~  ^°x\zaY\zPxY\z-  (5) 

Pxz  and  fiYz  in  equation  (4)  are  the  slope  coefficients  between  Z  and  Zand  Z  and  Y,  respectively, 
and  Pxy\z  in  equation  (5)  is  the  partial  correlation  between  X  and  F,  given  Z.  The  conditional 
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variances,  erj \z  and  Oy\z  ,  are  allowed  to  be  different  across  the  two  treatment  groups  but  are 
assumed  to  not  depend  on  the  value  of  Z.  Gadbury  et  al.  (2001)  showed  that, 

gD  —  (aX\ Z  —  aY\Z )  +  2 Ox\zOY\z[ 1  -  Pxy\z )  +  (Pxz  —  Pyz)2(Jz 

=  °D\Z  +  (Pxz  ~  Pyz)2(Jz-  (6) 

So  erj  is  comprised  of  two  components,  one  that  can  be  attributed  to  subset  interaction 
(the  second  term  in  (6))  and  one  that  can  be  attributed  to  subject-treatment  interaction,  within 
subsets  (the  first  term  in  (6)).  The  quantity,  ( /3XZ  ~  Pyz)2(Jz^  can  be  estimated  using  the  observed 
data.  When  (ixz  A  /?zz,  then  a^z  within  subpopulations  is  smaller  than  the  unconditional 
subject-treatment  interaction  variance,  erj.  The  conditional  proportion  of  IQI  within  the  subset 
(or  subpopulation)  defined  by  a  value  of  the  covariate  Z,  PIQIz=Z(),  is  given  by  a  quantity 
analogous  to  that  given  in  equation  (2),  except  using  the  conditional  mean  and  conditional 
variance  of  D,  given  Z.  Since,  pD\z=Zo  could  be  greater  than  jiD  when  ftxz  =£  /?yz,  and  both 
°x\Z  —  °x  and  °y\z  —  °Y’  one  could  identify  subsets  of  the  population  for  which  < 

piQimax 

Returning  to  the  data  example,  let  Z  =  baseline  weight.  A  test  for  a  baseline-treatment 
interaction  is  significant  (p-value  =  0.007),  f]X7  =  0.082,  f]r/  =  -0.075, ,  and  <j\  =  165.9  so  that 

an  estimated  J3XZ  -  fiv/  az  =  2.02  kg  of  aD  is  explained  by  the  baseline  weight  covariate.  Figure  1 

is  a  plot  of  the  data  showing  the  interaction  between  treatment  and  baseline  weight.  The  vertical 
line  is  plotted  at  juz=z  =  89.6  kg,  where  z  is  the  mean  of  all  69  baseline  weights.  Estimable 
bounds  for  the  remaining  unexplained  subject-treatment  interaction  standard  deviation  within 
subpopulations  aaz  can  be  bounded  by  quantities  ( a'™z ,  j  that  are  estimable  by  setting  the 
nonestimable  partial  correlation  in  (6)  equal  to  1  and  -1  respectively,  and  estimating  the 
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conditional  variance  of  X  and  Y ,  given  Z,  using  the  mean  squared  error  of  models  that  regress  X 
on  Z  and  Y  on  Z.  This  gives  estimated  bounds  equal  to  0.1  and  6.1. 

If  the  distribution  of  D  at  a  given  value  of  Z  is  normal  with  mean  and  variance  given  by 
(4)  and  (5),  then  bounds  for  the  PIQI  at  a  given  Z  =  z0  are  given  as 


PIQIS  =  * 


\°X\Z  °Y\Z  I 


,  and  O 


°X\Z  +  °Y\Z 


=  PIQI^f- 


The  quantities  in  (7)  can  be  estimated  from  the  regression  of  X  on  Z  and  of  Y  on  Z. 


Z  =  Baseline  weight 


Figure  1:  Plot  of  weight  loss  in  kilograms  (kg)  at  12  weeks  (positive  values  are  a  weight  loss) 
versus  baseline  weight  in  kg.  The  fitted  lines  are  from  regressing  X  on  Z  and  Y  on  Z.  The  vertical 
dashed  line  is  at  the  sample  mean  baseline  weight. 


Table  1  shows  the  estimated  mean  treatment  effect  for  3  values  of  baseline  weight,  the 
mean  baseline  weight  and  one  and  two  standard  deviation(s)  above  baseline  weight.  The  standard 
deviation  of  baseline  weight,  sz,  was  computed  from  all  69  baseline  weights.  Estimates  of  the 
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two  conditional  standard  deviations  in  (7)  are  nearly  the  same,  so  that  the  estimated  minimum 
PIQI  is  not  different  from  zero.  The  estimated  maximum  is  shown  in  Table  1  along  with  standard 
errors  obtained  from  2000  bootstrap  samples  within  treatment  groups. 


Table  1:  Estimated  mean  treatment  effect  and  maximum  PIQI  (and  standard  error)  at  three 
different  values  of  baseline  weight. 


Zo 

Md\z=zo  (se*) 

max 

PIQIaz^  (se*) 

89.61 

2.16  (0.780j 

0.363  (0.052) 

102. 52 

4.16  (1.256] 

0.248  (0.067) 

115.43 

6.17  (2.0211 

0.156  (0.078) 

Footnotes:  *The  standard  error  estimates  (se)  are  based  on  2000  bootstrap  samples. 

l.z0  =  z  2.  z0=z  +  sz  3.z0=z  +  2sz 

Some  conclusions  can  be  summarized  from  the  analysis  of  these  data  using  baseline 
weight  as  a  covariate. 

1.  Data  suggest  some  evidence  of  subject  -  treatment  interaction  in  the  population  due 
to  a  significant  treatment-covariate  interaction. 

2.  Sometimes  transformations  are  sought  to  remove  interactions  so  that,  on  the 
transformed  scale,  more  comprehensive  statements  about  treatment  effects  can  be 
made.  However,  if  measurements  are  obtained  on  a  clinically  meaningful  scale  and 
subject-treatment  interaction  is  present  in  the  population,  it  cannot  be  removed  by 
transformations. 

3.  Interactions  like  that  shown  above  can  highlight  subpopulations  that  may  respond 
differently  to  a  treatment.  Although  the  estimated  lower  bound  for  PIQI  at  the  three 
values  of  baseline  weight  in  Table  1  is  not  different  from  zero,  the  estimated  upper 
bound  can  be  quite  large.  However,  this  estimated  upper  bound  decreases  for  larger 
values  of  baseline  weight  and,  at  two  standard  deviations  above  baseline,  the  estimate 
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of  the  maximum  PIQI  is  only  two  standard  errors  from  zero.  This  suggests,  based  on 
these  data,  that  most  individuals  with  a  larger  baseline  weight  should  benefit  from  the 
RGL  treatment  versus  a  more  traditional  low-fat  diet  -  at  least  when  assessed  over  a 
12  week  period.  It  is  less  clear  whether  the  RGL  diet  would  perform  better  for  12 
week  weight  loss  than  the  low-fat  diet  for  individuals  with  a  baseline  weight  below 
the  average  weight  of  89.6  kg. 

4.  Additional  covariates  that  are  predictive  of  weight  loss,  regardless  of  whether  they 
interact  with  the  treatment,  can  tighten  the  estimated  bounds  for  PIQI  by  reducing  the 
estimated  conditional  variances  of  the  outcome  variables,  given  the  set  of  covariates. 

3.1.2.  A  categorical  covariate 

Analogous  results  to  those  described  above  for  a  continuous  covariate  can  be  derived  for 
a  categorical  one  as  well.  In  particular,  the  subject-treatment  interaction  variance  decomposes 
into  a  covariate-treatment  interaction  term  (the  explainable  component)  and  a  within  group 
variance  (an  unexplainable  component). 

A  slightly  different  approach  from  that  above  helps  facilitate  the  derivation.  Suppose  Z  is 
a  categorical  covariate  with  g  levels.  A  balanced  design  is  considered  here,  so  there  are  n  units 
per  group  for  a  total  of  ng  experimental  units.  Assume  as  before  that  a  bivariate  set  of  potential 
outcomes  are  randomly  generated  from  a  population  model,  and  denote  the  set  of  potential 
outcomes  as  (Zi;-,  Ti;),  i  —  1,2, ... ,  g,j  —  1,2, ... ,  n.  From  the  set  of  potential  outcomes, 

D.  -  X ..  -Y.  is  a  individual  treatment  effect,  Dy.  —  -  Y?_i  Da  —  X„  -  Y  is  a  mean  treatment 

effect  within  the  ith  level  of  z,  where  Xf  =  (1/  n)^  .  Xij  and  Yia  =  (1  /n)^Y  K  .  Define  the 

variance  of  these  individual  effects  as, 
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(8) 


5 


2 

D 


8  n 

tup,- 

<=i  j= i _ 

ng- 1 


ZZ 


D» 


where  D  =  '  1  /  The  quantity  in  (8)  can  be  thought  of  as  a  finite  population  version  of 

/  ng 

cj2d  from  the  prior  section.  If  S  'n  >  0 .  then  there  is  subject-treatment  interaction  present  among 


the  ng  subjects  in  the  sample.  It  can  be  shown  that, 


n9  ~  1  c2  _  C2  ,  "Zf=i(Pzt~P)2 
0(n-  1)  °  °lz  g(n- 1) 


(9) 


±t(D'-3*) 

where  Sj)lz  =  '  1  1  1 -  represents  a  within  group  variance  of  individual  treatment 

gin- 1) 

effects.  Equation  (9)  shows  that  the  components  of  Sj  include  both  subject-treatment  interaction 
within  subsets  specified  by  S^z  and  subset  treatment  interaction  term  that  is  a  function  of 

£f_i(Z)z.  —  D)  .  When  Yn=i{Dz.  —  D)  >0,  the  mean  treatment  effect  within  subsets  varies 
across  the  subsets.  In  the  extreme  case  that  Sf^2  —  0,  subject-treatment  interaction  in  the  set  of 
ng  subjects  is  completely  explained  by  the  interaction  across  subsets,  which  indicates  a  constant 
individual  effect  of  treatment  T  relative  to  treatment  R  within  subsets.  In  the  other  extreme  that 
Sp\z  —  Sq,  then  Z  is  not  useful  for  predicting  subsets  of  individuals  (among  the  ng  individuals) 
who  may  respond  successfully  to  one  treatment  over  the  other. 

None  of  the  quantities  in  equation  (9)  can  be  calculated  from  actual  observed  data  post 
treatment  assignment,  because  all  potential  outcomes  are  not  observable.  However,  a  post 


treatment  assignment  “estimate”  for  the  second  term  in  (9), 


nZf=1(uzrn)2 
a(n- 1) 


is 
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a  scalar  of  the  usual  sum  of  squares  for  the  subset-treatment  interaction  term  in  a  2  x  g  factorial 
analysis  of  variance  computation  with  ^  observations  for  each  treatment  group  combination. 

Consequently,  in  an  ANOVA  model  with  weight  loss  as  a  response  and  treatment,  Z,  and  a 
treatment-Z  interaction  as  explanatory  variables,  an  F-test  for  the  contribution  of  the  interaction 
term  may  not  only  be  used  to  diagnose  the  degree  of  subset  treatment  interaction,  but  also 
provides  evidence  that  S^z  <  Sj,  and  hence,  erj |Z  <  erj. 

|Z,  the  first  term  in  equation  (9),  may  be  used  to  evaluate  the  PIQI  within  groups,  as 

before.  If  Dtj~N  (jiDz.,  (*d\z)  J  ~  1<  ■■■  -  n>  then  bounds  for  the  PIQI  at  a  given  Z  =  z0  are  the 
same  as  those  given  in  equation  (7),  and  estimates  for  the  parameters  in  these  bounds,  /rDz  ,  °x\ z, 
and  cry|Z  ,  can  be  estimated  from  sample  statistics.  One  could  estimate  oxyz  and  oY\ z  by  pooling 
sample  variances  of  observed  outcomes  across  groups  or  separately  within  each  group.  The  latter 
approach  is  equivalent  to  conducting  a  separate  analysis  within  each  subset.  It  is  possible  that 
the  bounds  for  PIQIZi  vary  widely  across  subsets,  with  some  subsets  exhibiting  the  plausibility  of 
more  treatment  heterogeneity  than  others.  Subsets  with  a  positive  estimated  lower  bound  for  the 
subject  -  treatment  interaction  variance  (and/or  PIQI) ,  or  a  small  estimated  upper  bound(s)  may 
be  particularly  informative. 

The  categorical  variable  available  in  the  illustrative  data  set  is  gender.  A  test  for  a  gender- 
treatment  interaction  was  not  significant,  indicating  that  the  effect  of  the  RGL  diet  with  respect 
to  the  low  fat  diet  was  not  estimated  to  be  different  across  genders,  or  that  gender  does  not 
explain  any  treatment  heterogeneity.  There  is  another  technique  that  has  been  proposed  to 
evaluate  the  potential  presence  of  treatment  heterogeneity.  This  is  the  proportion  of  similar 
response  (Rom  and  Hwang  1996;  Stine  and  Heyse  2001),  discussed  next. 
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3.2  Proportion  of  Similar  Response 

Inman  and  Bradley  (1989)  provide  a  comprehensive  treatment  of  the  PSR  where  its 
calculation  is  defined  as  a  measurement  of  overlap  between  two  probability  density  functions 
(pdfs),  given  as 

PSR  =  J  min(fx(x),fr(x))  dx,  (10) 

where  fx(x)  and  fY  (x)  are  the  pdfs  of  the  outcome  variables  X  and  Y  to  treatments  T  and  R, 
respectively.  There  has  been  some  confusion  regarding  the  interpretation  of  the  PSR  (Inman  and 
Bradley  1989)  and  particular  disagreement  over  its  use  as  a  measurement  of  treatment 
heterogeneity  (Gastwirth  1975;  Senn  1997,  2006b).  The  overlap  of  the  density  curves  that  leads 
to  the  PSR  calculation  provides  a  natural  way  to  think  about  treatment  heterogeneity  and  IQI.  In 
support  of  such  usage,  Gastwirth  (1975)  noted  that  the  maximum  values  for  the  PSR  and  the 
P(X  <  Y)  =  P(D  <  0)  which  are  1  and  0.5,  respectively,  have  a  similar  interpretation. 
Additionally,  the  overlap  seems  to  suggest  that  as  the  PSR  increases,  the  potential  for  a  value 
from  fx(x)  to  be  less  than  a  value  from  fY(x)  also  increases.  However,  in  an  assessment  of  the 
PSR,  Senn  (2006b,  pp.  3944-3945)  points  out,  “If  every  patient  benefits  by  having  his  or  her 
outcome  improved  by  the  same  amount  [under  treatment  T]  compared  to  what  it  would  have 
been  [under  treatment  R],  then  100  percent  of  the  patients  have  benefited”  (brackets  added  to 
provide  context  for  the  notation  herein).  Thus  Senn  identifies  what  is  clear  using  potential 
outcomes,  which  is,  if  D  is  a  constant,  erj  =  0  and  the  PIQI  =  0,  even  when  the  PSR  >  0. 

The  calculation  of  the  PSR  depends  on  values  of  x  such  that  fx(x )  =  fY(x)  .  If  the  two 
distributions  are  equal,  fx (x)  =  fY  (x)  for  all  x  and  the  PSR  is  equal  to  1.  If  fx  (x)  A  fY (x)  for 
any  x,  the  PSR  is  equal  to  0.  For  clarity  of  illustration,  assume  that  X  and  Y  follow  a  bivariate 
normal  distribution  throughout,  and,  without  loss  of  generality,  assume  fix  >  /J.Y,  erj  =  er2,  and 


15 


Oy  =  k2a2  for  the  remainder  of  this  entry.  When  k  A  1  there  will  be  exactly  two  finite  points  of 
equality,  xL,  xu  with  xL  <  xv,  where  fx(xL)  —  fy(xL )  and  fx(xu)  =  fy(xu )•  Both  xL  and  x(7 
result  from 


(fiY  —  k2Hx)  +  yjk2{p.\  +  | u2  —  2 fJ-xk-y)  ~  k22o 2  ln(/r)(l  —  k 2) 


(id 


A  similar  representation  for  the  points  of  equality  can  be  found  in  Inman  and  Bradley  (1989). 
The  PSR  can  be  calculated  by  adding  three  probabilities  shown  in  equation  (12). 

PSR  =  P(X  <  xi)  +  P(xL  <Y<  Xu)  +  P(X  >  x„)  (12) 

When  k  —  1  ,  fx  (x)  =  fY (x)  at  a  single  value  xL  —  The  calculation  of  the  PSR  is  then 

simplified  to 

PSR  -  P{X  <  xL)  +  P(Y  >  xL)  =  2  x  d>  CY~fX)-  (13) 

The  following  proposition  establishes  a  relationship  between  the  PSR  and  the  PIQI,  with  the 
details  of  the  derivation  given  in  the  appendix. 

Proposition  1  Assuming  the  bivariate  normal  distribution  described  earlier, 

PSR 

pjQjmax  >  (14) 


with  equality  at  k  =  1. 

A  similar  result  holds  for  subpopulations  defined  by  either  a  continuous  or  a  categorical 
covariate,  Z.  The  conditional  PSR  is  defined  using  the  conditional  distributions  of  X  and  Y  given 
the  observed  covariate  z0  so  that 

PSRZo  =  f  min(fX\Zo(x)JY\z0(x))dx.  (15) 
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As  with  the  PSR,  the  relationship  between  PSRZq  and  PIQIZo  depends  on  whether  erf  |Z  =  crf|Z. 

Let  erf|Z  =  /e|erf|Z,  where  /cf  is  the  conditional  k2 .  Given  conditional  distributions  fx\Z()  and 
fY\Zo  at  a  finite  value  of  z0,  then 

PSR 

PIQC*  ^  —2^-  (16) 

with  equality  holding  at  kz  —  1.  This  result  follows  directly  from  the  proof  of  proposition  1.  We 
are  not  aware  of  a  similar  result  relating  the  PSR  to  the  minimum  bound  for  a  PIQI. 

Figure  2  illustrates  the  estimated  PSR  for  the  data  example.  The  first  panel  shows  the 
unconditional  PSR  and  the  other  3  show  the  estimated  conditional  PSR  at  the  three  values  of  z0 
given  in  Table  1.  The  estimated  values  for  (1/2)PSR  for  the  4  plots  in  Figure  2  are  very  close  to 
those  reported  earlier  for  the  estimated  maximum  PIQI  (e.g.,  see  Table  1)  because  the  value  for  k 
(and  kz)  are  very  close  to  1.  The  estimated  PSR  decreases  with  increasing  baseline  weight  -  the 
result  of  the  treatment  -  baseline  weight  interaction.  Standard  errors  for  an  estimated  PSR  can  be 
obtained  using  bootstrap  samples,  and  in  the  data  example  were  similar  to  those  reported  earlier 
for  the  estimated  PIQI™0*. 
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Figure  2:  Illustration  of  the  PSR  using  only  the  marginal  distributions  of  X  and  Y  (panel  (a))  and 
at  three  values  of  baseline  weight  given  in  Table  1  (panels  b  -  d). 


3.3  Cross-Over  Designs 

Perhaps  the  most  straightforward  design  for  estimating  erj  and  the  PIQI  is  a  cross-over 
design.  A  large  body  of  literature  exists  on  estimating  mean  treatment  effects,  mean  period 
effects,  carry-over  effects,  etc.  (e.g.,  Senn  2006a;  Yang  and  Stufken  2008).  Mixed-effects 
models  fit  to  data  from  a  cross-over  design  with  a  random  subject  effect  may  even  compute  what 
some  have  referred  to  as  a  “subject- treatment  interaction  variance”  (e.g.,  Hauck  et  al.  2000; 
Endrenyi  and  Tothfalusi  1999).  However,  this  variance  computed  from  observed  data  may  not 
equal  a  variance  of  true  individual  effects  without  certain  assumptions  and/or  depending  upon 
how  one  defines  an  individual  effect  in  multiple  period  designs.  We  illustrate  concepts  for  a  two 
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period  two  treatment  cross-over  design,  assuming  no  carry  over  effects  but  that  period  effects 
may  vary  across  individuals.  Potential  outcomes  are  (Xlt  Yj_)  at  period  1  and  (X2,  Y2)  at  period  2, 
and  these  are  rewritten  as  {X  —t,  Y  —  r)  at  period  1  and  [X  +  t,  Y  +  r)  at  period  2.  Now  the 

pair(X,F)  quantify  a  mean  response  over  the  two  periods  on  each  of  the  two  treatments,  and  the 
pair  ( t ,  r)  accommodate  effects  from  period  1  to  period  2  for  each  outcome.  There  are  two 
“true”  individual  treatment  effects  given  by  L\=(X  -Y)-(t-  r)  at  period  1  and 
D0  =(X  -Y)  +  (t  -  r)  at  period  2.  In  some  applications  it  may  be  Dx  that  is  the  effect  of  most 
interest.  Another  effect  may  be  defined  as  the  average  over  the  two  time  periods,  denoted  as 
D  —  (Dx  +  D2)/ 2.  The  true  individual  effect  is  constant  across  the  two  time  periods  if  t  —  T  —  0 , 
an  assumption  that  may  be  reasonable  with  no  carry-over  effects. 

Since  each  individual  is  crossed  over  from  one  treatment  to  another  after  a  washout 
period,  an  individual  treatment  effect  may  seem  to  be  observable.  Assume  that  n,  subjects  are 
randomly  assigned  to  the  sequence,  TR,  where  TR  implies  treatment  T  at  time  1  and  treatment  R 
at  time  2,  and  n2  subjects  to  the  reverse  sequence  of  treatments,  RT,  with  nx  +  n2  —  n.  The 
observed  differences  are  dj,  for)  =  1,2, ... ,  n  and  can  be  written  as  cl  •  =(X  ■  -Yj)-(t j  +  t ;)  if 

the  j'th  subject  was  assigned  to  sequence  TR,  and  dj  =  (X j  -  Yj')  +  (rJ  +  Tj )  if  assigned  to 
sequence  RT. 

A  straightforward  naive  estimate  of  the  PIQI  may  be  obtained  using  equation  (2),  with 
d  =  dj  to  estimate  /iD  and  sj  =  -^—^YIj=i{dj  —  d)2  to  estimate  erj.  Following  results 

analogous  to  Gadbury  (2001)  it  can  be  shown  that  sj  is  positively  biased  for  VAR(D)  =  erj  when 

an  individual  effect  is  defined  by  D  =  (Dx  +  D2)/2.  and  the  bias  term  is  a  function  of 

it  j  +  Tj),  j  —  1,. .  .,n .  If  (tj  +  ryjis  assumed  to  be  constant  across  individuals  (i.e.,  a  constant 
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sequence  effect),  then  the  bias  can  be  estimated  from  observed  data.  If  this  is  not  assumed,  but  it 
is  assumed  that  the  true  individual  effect  is  constant  across  periods,  meaning  that  tj  —  Tj  for 

j  =  1,. .  .,n ,  then  the  bias  can  be  estimated  if  tj  is  constant  across  individuals,  i.e.,  a  constant 

period  effect.  If  period  effects  vary  across  individuals,  but  tj  =  Tj  for  all  j,  then  the  bias  can  be 

estimated  with  an  extension  to  the  design  such  as  the  Balaam  (Balaam  1968)  design,  where  some 
subjects  remain  on  the  same  treatment  over  the  two  periods  (i.e.,  TR,  RT ,  7T,  and  RR  sequences). 

Required  assumptions  for  direct  estimation  of  erj  may  be  more  plausible  in  certain 
applications  than  assumptions  that  are  required  without  the  multiple  period  feature  of  the  design. 
Even  with  no  additional  assumptions,  estimated  bounds  for  erj  (or  PIQI)  may  be  tighter  than 
those  obtained  from  single  period  designs.  Repeated  measures  cross-over  designs  have 
advantages  over  single  period  designs  for  estimating  subject-treatment  interaction  and  its 
consequences  (e.g.,  Senn  2001).  More  methodological  development  is  needed  to  define  the 
required  assumptions  and  resulting  estimators  from  different  types  of  cross-over  designs,  and 
potential  outcomes  may  be  the  best  structure  to  use  when  doing  this. 

Cross-over  designs,  however,  are  not  always  practical  to  implement  in  many  applications 
(cf.,  Brown  1980;  Senn  2001).  For  instance,  in  applications  like  the  data  example  used  herein, 
there  may  be  limitations  in  using  cross-over  designs  when  the  primary  outcome  variable  is 
weight  loss.  The  true  individual  effect  of  a  treatment  at  time  1  may  be  substantially  larger  than  at 
time  2  because  people  tend  to  lose  weight  more  rapidly  at  first,  and  substantial  carry-over  effects 
may  be  likely  as  well. 


4.  DISCUSSION  AND  CONCLUSIONS 

In  1892  Sir  William  Osier  stated,  “If  it  were  not  for  the  great  variability  among 
individuals  medicine  might  as  well  be  a  science  and  not  an  art”  (extracted  from  Roses  2000,  page 
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857).  Thus  the  topics  discussed  here  have  been  long  recognized  as  important  considerations 
when  selecting  a  “best”  treatment  for  an  individual.  Individual  treatment  heterogeneity  and  its 
consequences  should  be  an  important  consideration  when  designing  clinical  trials  and 
interpreting  treatment  efficacy  and  safety  for  a  target  population.  The  quantities  discussed  herein 
may  also  inform  the  pursuit  of  pharmacogenetic  research  which  seeks  to  identify  genomic 
predictors  of  response  to  treatments  (e.g.,  Hu  et  al.  2006).  It  is  often  poorly  understood  how 
much  heterogeneity  might  be  present  in  the  first  place  and  whether  a  search  for  such  gene- 
treatment  interactions  that  explain  this  heterogeneity  will  be  fruitful  (Senn  2001).  Perhaps  we 
should  invest  most  readily  in  finding  genetic  factors  influencing  variability  in  treatment  response 
for  those  treatments  for  which  we  have  actually  demonstrated,  rather  than  merely  presumed, 
large  variability  in  response. 

Evaluating  the  plausible  variance  in  treatment  response,  and  even  more  so  the  proportion 
of  a  population  with  an  IQI,  has  other  applications  as  well.  The  Latin  enjoinder  primum  non 
nocere  (above  all,  do  no  harm)  frequently  (mis)attributed  to  Hippocrates  (Smith  2005)  remains  a 
mainstay  of  medical  thinking  today.  Thus,  regulatory  agencies  such  as  the  US  Food  and  Drug 
Administration  may  wish  to  know  not  only  the  average  effect  of  a  drug  compared  to  placebo,  but 
the  probability  that  it  will  have  a  poorer  effect  than  no  drug.  Similarly,  when  faced  with  the 
possibility  of  approving  a  new  drug  that  is  no  more  efficacious  on  average  than  an  existing  drug 
which  has  been  widely  used  and  which  has  already  survived  the  baptism  of  fire  that  is 
widespread  clinical  use,  it  is  tempting  to  ask  “why  do  we  need  to  approve  this  new  drug  if  it  is  no 
more  efficacious  than  the  old  drug  we  know  and  trust?”  A  typical  response  is  one  voiced  by  the 
director  of  the  UK’s  National  Institute  for  Health  and  Clinical  Excellence’s  health  technology 
evaluation  centre,  who  recently  stated,  “Different  people  respond  in  different  ways  to  treatment, 
and  the  committee  heard  from  clinical  experts  and  patients  about  the  importance  of  having 
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multiple  options  available”  (Mayor  2010).  That  is,  it  is  often  presumed  that  although  drug  A  may 
be  no  better  than  drug  B  on  average,  for  some  persons,  drug  A  works  better  than  does  drug  B  and 
vice  versa  for  other  persons.  If  this  is  the  case,  then  having  multiple  drugs  on  the  market  may  be 
important  even  if  there  is  no  difference  in  their  effects  on  average  and  they  cannot  be  used  in 
combination.  However,  rather  than  accepting  the  premise  as  true  a  priori,  the  results  shown 
herein  may  help  lead  to  new  ideas  for  the  evaluation  of  the  plausibility  and  frequency  of  such 
IQIs. 

The  ideas  may  also  be  useful  in  evaluating  advertising  claims.  Consider  the  context  of 
claims  for  weight  loss  products  which  can  often  be  quite  extravagant.  The  US  Federal  Trade 
Commission  (FTC)  states  that  “No  [weight  loss]  product  will  work  for  everyone,”  and  therefore 
claims  implying  that  a  “product  causes  substantial  weight  loss  for  all  users”  is  a  likely  sign  of 
fraud  (FTC  statement).  Is  there  evidence  a  company  could  provide  to  FTC  to  show  that  in  their 
randomized  clinical  trial  (RCT)  showing  a  positive  mean  effect,  the  plausible  proportion  of 
people  who  will  have  an  effect  less  than  a  threshold  r  is  negligible?  Alternatively,  is  there 
evidence  that  FTC  could  muster  to  show  a  company  that  their  claim  of  a  universal  positive  effect 
is  almost  certainly  untrue  despite  their  being  a  positive  mean  effect?  Again,  the  results  described 
herein  may  help  clarify  the  issues  involved  when  answering  these  questions. 

Finally,  one  can  imagine  applications  in  legal  settings  (see,  for  example,  Marchant  2001, 
2010).  Imagine  that  a  plaintiff  (e.g.,  a  consumer)  sues  a  defendant  (e.g.,  a  distributor  of  a  drug, 
food,  or  pharmaceutical)  claiming  that  use  of  defendant’s  product  caused  a  stroke  secondary  to 
markedly  elevated  blood  pressure  (BP)  as  a  result  of  using  the  product.  Imagine  further  that 
defense  experts  present  evidence  that  well-designed  RCTs  show  an  average  effect  of  the  product 
on  BP  to  be  less  than  or  equal  to  zero.  Plaintiffs  experts  reply  that  there  is  great  interindividual 
variability  in  response  and  even  though  the  average  response  is  less  than  or  equal  to  zero,  some 
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people  will  be  hypersensitive  hyper-responders  with  extreme  BP  increases.  What  evidence  can 


the  court  bring  to  bear  on  the  question  of  how  probable  it  is  that  plaintiff  was  such  a  hyper¬ 
responder?  The  first  question  which  must  precede  this  is  what  evidence  is  there  that  hyper¬ 
responders  in  the  opposite  direction  even  exist  and  with  what  frequency?  The  techniques  herein 
may  provide  a  plausible  range  of  answers. 

Potential  outcomes  are  a  natural  way  to  define  individual  treatment  effects  and  metrics 
that  quantify  treatment  heterogeneity  as  well  as  the  risk  of  a  qualitative  interaction  across 
individuals  or  groups  of  individuals.  Existing  techniques  that  seek  to  evaluate  treatment 
heterogeneity  have  limitations,  and  these  limitations  are  made  clear  by  potential  outcomes.  The 
potential  outcomes  framework  can  delineate  heterogeneity  that  is  observable  from  that  which  is 
not,  and  unobservable  heterogeneity  can  often  be  bounded  by  quantities  that  can  be  estimated  in 
observed  data.  Thus,  the  potential  outcomes  framework  is  a  useful  complement  to  existing 
techniques  to  evaluate  treatment  heterogeneity  and  qualitative  interactions,  and  they  should  be 
used  as  such  when  analyzing  data  from  randomized  trials.  Their  use  may  also  suggest  new 
directions  in  the  design  of  randomized  trials  -  directions  that  do  not  compromise  estimation  of 
mean  effects  but  also  allow  for  more  direct  evaluation  of  treatment  heterogeneity.  Eventually, 
perhaps,  reporting  of  treatment  heterogeneity  and  risks  of  qualitative  interactions  (at  either  the 
individual  or  group  level),  in  addition  to  summary  measures  such  as  mean  effects,  will  be  a  more 
standard  practice,  and  a  response  to  a  perceived  need  that  has  been  recognized  by  others  in  recent 
years  (cf.,  Longford  1999). 


APPENDIX:  Proof  of  Proposition  P. 

The  equality  at  k  =  1  is  straightforward.  Let  k  >  1.  When  pXY  —  —  1,  the  (x,y)  pairs  are 
constrained  to  the  line  y  —  pY  +  kpx  ~  kx  with  probability  one.  If  we  let  x  and  y  be  equal  and 

set  to  the  common  value  x_i  =  ^Y+kflx  then  it  can  be  shown  that 

1  l+k 
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PIQI™a*  =  P(X  <  x_x)  =  P(Y  >  x_i). 


(17) 


Therefore, 

2  x  PIQIma*  =  p(X  <  x_x)  +  P(Y  >  x_x) 

=  P(X  <  xL)  +  P(xL  <  X  <  x_0  +  P(xL  <  Y  <  Xy)  —  P(xL  <Y<  x_J 
+P(Y  >  Xu)  +  P{X  >  x0)  -  P(X  >  xy) 

=  P(X  <  xL)  +  P(xL  <Y<  xv)  +  P(X  >  Xu)  +  P(xL  <  X  <  x_i) 

-  P(xL  <Y<  x_i)  +  P(Y  >  Xu)  -  P(X  >  Xu) 

=  PSR  +  P{xL  <  X  <  x_i)  -  P{xL  <Y<  x_i)  +  P(Y  >  Xu)  -  P(X  >  xv) 

DCR 

Thus,  PIQIrnax  >  because  P(xL  <  X  <  x_x)  >  P(xL  <  T  <  x_x)  and  P(T  >  xy)  > 

P(X  >  xa)  when  k  >  1.  Proof  for  k  <  1  is  similar.  ■ 
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